![]() |
|
![]() |
This article is reprinted from the October 1996 issue
of Inside Solaris, a monthly publication of
The Cobb Group.
Writing a script to monitor your systemBy Marco C. Mason No matter what precautions you take, occasions always pop up when you need to drop whatever you're doing and attend to some urgent task. You just can't avoid it. As long as people use computers, computers will run out of resources. As you know, once a computer runs out of resources, such as disk or swap space, recovery can be difficult. If the system crashes, you may spend hours getting it running correctly again. If you could keep a close eye on your system, you could find out when a catastrophe is imminent and take steps to avert it. Any system administrator would gladly spend a few minutes to prevent a multihour recovery operation. In this article, we'll show you how to build some tools that help you monitor your system and prevent major breakdowns. What do you want to monitor?The resources you should monitor vary depending on the applications you run and the physical parameters of the system. Some of the situations you may want a warning about include:
The monitoring script we create in this article will monitor disk space on the / and /tmp file systems, as well as available swap space and the number of processes running. Using the basic template we put together here, you can monitor as many things as you like. Monitoring free space on a file systemOne common cause of program failure is a file system running out of space. The amount of free space on a file system normally decreases as you accumulate data. You can use the df command to see how much disk space is free on all your file systems, like this: # df -b / Filesystem avail /dev/dsk/c0t0d0s0 354283 As you can see, the / file system has about 350 megabytes of free space. You can examine the output of the df command to decide which file systems are too full for comfort. For the purposes of the script that we'll put together later, we want only the number in the second column of the second line. To do so, we pipe the results of df -b to awk, telling it that we want only the second field on the second line, like this: $ df -b / | awk 'NR==2 { print $2 }' 354283 We can place this value in a variable by enclosing the expression in grave accents (') and treating it just like a value in an assignment statement. The completed statement that puts the free space of the / directory into the TEMP variable is TEMP='df -b / | awk 'NR==2 { print $2 }' ' Monitoring free swap spaceAnother major catastrophe occurs when the system runs out of swap space. In this case, Solaris must start killing jobs to free swap space. Since Solaris doesn't know which jobs are the most important, it may easily kill your mission-critical jobs. If you've installed Solaris in the normal way, the swap area shares a disk slice with the /tmp file system. In this case, you don't necessarily need any special code to check for low swap space. Instead, you can simply use the check for low free space on the /tmp file system. On the other hand, if you've separated the swap space from the /tmp file system, you need a different method of finding out how much free swap space you have. In this case, you can use the swap -l command to list the swap areas, like this: $ swap -l swapfile dev swaplo blocks free /dev/dsk/c0t0d0s1 102,1 8 131752 110784 /extra_swap - 8 992 992 /extra_swap_2 - 8 1408 1408 As you can see here, this system has three swap areas, with a total of 113,184 blocks free (nearly 60MB). If you're going to write a script to monitor your swap space, all you need to do is add the amount of free space for all swapping partitions and compare the result to a threshold value to see if you're running dangerously low. You can pipe the output of swap -l to a simple awk script to compute the total free space. The awk script must simply add together all the values in the fifth column for all lines after the first. At the command line, type the following command to get the amount of free swap space: $ swap -l | awk 'BEGIN {ttl=0} NR>1 {ttl+=$5} ┬END {print ttl}' 113184 As you'd expect, we can place the amount of free swap space in a shell variable by enclosing the preceding expression in grave quotes and making the assignment, like this: TEMP='swap -l | awk 'BEGIN {ttl=0} ┬NR>1 {ttl+=$5} END {print ttl}'' How many processes are running?Perhaps your system has a problem when too many processes are executing at once. If so, you may want to monitor the number of processes executing at any given time. Counting the number of active processes on the system is easy. We use the ps -A command to report all processes, one per line. Then we use wc -l to count the number of lines, as follows: # ps -A | wc -l 44 So, to put the number of processes in a shell variable, we can use this command: TEMP='ps -A | wc -l' Checking your system state with a shell scriptYou can check for many other things, but this is a good start for our system-monitoring script. Once we've obtained the information we want, we use basically the same structure to determine whether the system is in trouble. We use an if statement to see whether we've violated the limit. If we have, we append a warning message to a report file and set the STATUS variable to 1, as shown in Listing A. The blue lines of code use a here document, as described in the article "Automating Applications that Accept User Input" in the June issue. These lines add a failure warning record to the file specified by REPORT. Finally, after the script checks all parameters, it decides whether to send E-mail and page the system administrator. It then deletes the temporary file it used to build the mail message. (On an early version of the ISOL_Monitor script, we inadvertently tested it. We forgot to delete the temporary file, and eventually the script told us that the /tmp file system was too full!) if [ ${STATUS} -gt 0 ]; then mail ${SYSADMIN} <${REPORT} cu pgr_${SYSADMIN} >/dev/null fi rm ${REPORT} Please note that for our purposes, we're assuming you
created a paging system named pgr_SysAdmin, where SysAdmin
is the username of your system administrator. Listing A if [ ${TEMP} -lt MIN_ROOT_SPC ]; then echo " Not enough!" cat <<- XYZZY >>${REPORT} Insufficient space on / (${TEMP} < ${MIN_ROOT_SPC}) XYZZY STATUS=1 fi We use this ISOL_Monitor structure throughout to warn the user about potential problems.
As you can see, we set obviously bad limits in order that you might see the script send you E-mail and page you. Also note that you need to change the SYSADMIN variable to your username. Once you install the script on your system, just tune these parameters to values that suit your needs.
Listing B #! /usr/bin/ksh #------------------------------------ # Monitor system statistics, and warn # sysadmin(s) of any impending probs. #------------------------------------ # CONFIGURATION MIN_ROOT_SPC=1000000 MIN_TEMP_SPC=2000000 MIN_SWAP_SPC=1000000 MAX_PROCS=3 SYSADMIN=marco PATH=/usr/sbin:/usr/bin # By default, we're not going to send a # page, or any E-Mail STATUS=0 REPORT=/tmp/ISOL_Monitor_${$} rm ${REPORT} # Is there enough space on /? TEMP='df -b / | awk 'NR==2 { print $2 }' ' echo ${TEMP} "blocks left on /" if [ ${TEMP} -lt MIN_ROOT_SPC ]; then echo " Not enough!" cat <<- XYZZY >>${REPORT} Insufficient space on / (${TEMP} < ${MIN_ROOT_SPC}) XYZZY STATUS=1 fi # Is there enough space on /tmp? TEMP='df -b /tmp | awk 'NR==2 { print $2 }' ' echo ${TEMP} "blocks left on /tmp" if [ ${TEMP} -lt MIN_TEMP_SPC ]; then echo " Not enough!" cat <<- XYZZY >>${REPORT} Insufficient space on /tmp (${TEMP} < ${MIN_TEMP_SPC}) XYZZY STATUS=1 fi # Is there enough swap space? TEMP='swap -l | awk 'BEGIN { total=0 } NR>=2 { total += $5 } ┬END { print total }' ' echo ${TEMP} "blocks of swap space left" if [ ${TEMP} -lt MIN_SWAP_SPC ]; then echo " Not enough!" cat <<- XYZZY >>${REPORT} Insufficient swap space (${TEMP} < ${MIN_SWAP_SPC}) XYZZY STATUS=1 fi # Are there too many processes running? TEMP='ps -A | wc -l' echo ${TEMP} "processes currently running" if [ ${TEMP} -gt MAX_PROCS ]; then echo " Too many!" cat <<- XYZZY >>${REPORT} Too many processes! (${TEMP} > ${MAX_PROCS}) XYZZY STATUS=1 fi # If we've detected any bad problems, # E-Mail the report to the sysadmin # and then issue a page if [ ${STATUS} -gt 0 ]; then mail ${SYSADMIN} <${REPORT} cu pgr_${SYSADMIN} >/dev/null fi rm ${REPORT} The ISOL_Monitor script monitors your system and alerts you when a resource is critically low.
|
Copyright (c) 1996 The Cobb Group, a division of Ziff-Davis Publishing Company. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of Ziff-Davis Publishing Company is prohibited. The Cobb Group and The Cobb Group logo are trademarks of Ziff-Davis Publishing Company. Questions? Comments? |